--- name: pydeseq2 description: Differential gene expression analysis for bulk RNA-seq count matrices using a DESeq2-like workflow in Python; use when you need Wald tests, FDR correction, and optional LFC shrinkage for condition/batch/covariate designs. license: MIT author: aipoch --- > **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills) ## When to Use Use this skill when you need to run DESeq2-style differential expression in Python, especially in these scenarios: 1. **Case vs control bulk RNA-seq** from a raw integer count matrix (e.g., treated vs control). 2. **Multi-factor designs** to adjust for batch effects or covariates (e.g., `~ batch + condition`, `~ age + condition`). 3. **DESeq2 migration** when converting an R DESeq2 workflow into a Python pipeline. 4. **Pipeline integration** where results must stay in Python objects (pandas/AnnData) for downstream QC, plots, or reporting. 5. **Requests mentioning** “DESeq2”, “differential expression”, “Wald test”, “FDR/padj”, “volcano plot”, “MA plot”, or “PyDESeq2”. ## Key Features - End-to-end DESeq2-like workflow: normalization (size factors), dispersion estimation/shrinkage, LFC fitting, outlier handling. - **Wald tests** for differential expression with **Benjamini–Hochberg FDR** (`padj`). - **Design formulas** in Wilkinson/R-style notation (single-factor and multi-factor). - **Contrast-based comparisons**: `[variable, test_group, reference_group]`. - Optional **Cook’s distance** outlier filtering and refitting. - Optional **LFC shrinkage** (apeGLM-style) for visualization/ranking. - Works naturally with **pandas** and can interoperate with **AnnData**. ## Dependencies Minimum environment (as documented in the source material): - Python **3.10–3.11** - `pydeseq2` (install via pip/uv) - `pandas` **>= 1.4.3** - `numpy` **>= 1.23.0** - `scipy` **>= 1.11.0** - `scikit-learn` **>= 1.1.1** - `anndata` **>= 0.8.0** (optional, for AnnData I/O) Optional plotting: - `matplotlib` (recommended) - `seaborn` (optional) Installation: ```bash uv pip install pydeseq2 ``` ## Example Usage The following script is a complete, runnable example for a standard treated-vs-control analysis. ```python import pandas as pd import numpy as np from pydeseq2.dds import DeseqDataSet from pydeseq2.ds import DeseqStats # ----------------------------- # 1) Load inputs # ----------------------------- # counts.csv is commonly stored as genes x samples; transpose to samples x genes. counts_df = pd.read_csv("counts.csv", index_col=0).T metadata = pd.read_csv("metadata.csv", index_col=0) # Ensure sample alignment common = counts_df.index.intersection(metadata.index) counts_df = counts_df.loc[common] metadata = metadata.loc[common] # ----------------------------- # 2) Basic filtering # ----------------------------- # Remove genes with very low total counts min_total_counts = 10 genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= min_total_counts] counts_df = counts_df[genes_to_keep] # Drop samples with missing condition metadata = metadata.dropna(subset=["condition"]) counts_df = counts_df.loc[metadata.index] # ----------------------------- # 3) Fit DESeq2 model # ----------------------------- dds = DeseqDataSet( counts=counts_df, metadata=metadata, design="~ condition", refit_cooks=True, n_cpus=1, ) dds.deseq2() # ----------------------------- # 4) Wald test with contrast # ----------------------------- ds = DeseqStats( dds, contrast=["condition", "treated", "control"], alpha=0.05, cooks_filter=True, independent_filter=True, ) ds.summary() # ----------------------------- # 5) Results + optional shrinkage # ----------------------------- res = ds.results_df.copy() sig = res[res["padj"] < 0.05].sort_values("padj") print(f"Significant genes (padj < 0.05): {len(sig)}") # Optional: shrink LFC for visualization/ranking (p-values do not change) ds.lfc_shrink() res_shrunk = ds.results_df.copy() # Export res.to_csv("deseq2_results.csv") res_shrunk.to_csv("deseq2_results_shrunk_lfc.csv") sig.to_csv("significant_genes.csv") # ----------------------------- # 6) Minimal volcano plot (optional) # ----------------------------- try: import matplotlib.pyplot as plt plot_df = res.copy() plot_df["neglog10_padj"] = -np.log10(plot_df["padj"].clip(lower=1e-300)) is_sig = plot_df["padj"] < 0.05 plt.figure(figsize=(9, 5)) plt.scatter( plot_df.loc[~is_sig, "log2FoldChange"], plot_df.loc[~is_sig, "neglog10_padj"], s=10, alpha=0.3, c="gray", label="Not significant", ) plt.scatter( plot_df.loc[is_sig, "log2FoldChange"], plot_df.loc[is_sig, "neglog10_padj"], s=10, alpha=0.6, c="red", label="padj < 0.05", ) plt.axhline(-np.log10(0.05), linestyle="--", color="blue", alpha=0.5) plt.xlabel("Log2 Fold Change") plt.ylabel("-Log10(adjusted p-value)") plt.title("Volcano Plot") plt.legend() plt.tight_layout() plt.savefig("volcano_plot.png", dpi=300) except ImportError: pass ``` ## Implementation Details ### Inputs and orientation - **Counts matrix** must be **samples × genes** with **non-negative integer** counts. - Many files are stored as **genes × samples**; transpose with `.T` after loading. ### Design formula (Wilkinson/R-style) - Use strings like: - `~ condition` (single factor) - `~ batch + condition` (batch-adjusted) - `~ age + condition` (continuous covariate) - `~ group + condition + group:condition` (interaction) - Put **adjustment variables first** (e.g., `~ batch + condition`) so the primary effect is interpreted cleanly. ### What `dds.deseq2()` does (high level) The fitting pipeline typically includes: 1. **Size factor** estimation (library-size normalization) 2. **Gene-wise dispersion** estimation 3. Dispersion **trend** fitting and **prior** estimation 4. **MAP dispersion** shrinkage 5. **Log2 fold change** fitting under the specified design 6. **Cook’s distance** outlier detection 7. Optional **refitting** after outlier handling (`refit_cooks=True`) ### Statistical testing and multiple testing correction - `DeseqStats(...).summary()` runs **Wald tests** for the requested coefficient/contrast. - Output columns commonly include: - `baseMean`: mean normalized expression - `log2FoldChange`, `lfcSE`, `stat` - `pvalue`: raw p-value - `padj`: **Benjamini–Hochberg FDR** adjusted p-value - Use `padj < alpha` (commonly 0.05) for significance. ### Contrast specification - Format: `contrast=["variable", "test_group", "reference_group"]` - Example: `["condition", "treated", "control"]` tests treated relative to control. ### LFC shrinkage (optional) - `ds.lfc_shrink()` applies shrinkage to **log2FoldChange** for more stable ranking/plots. - Shrinkage is intended for **visualization and prioritization**; statistical significance is still based on the (unshrunken) Wald test p-values. ### Notes on bundled references/scripts If your repository includes them, use: - `references/api_reference.md` for parameter/object details. - `references/workflow_guide.md` for extended workflows and troubleshooting. - `scripts/run_deseq2_analysis.py` for a CLI-style batch workflow (counts/metadata/design/contrast/output, optional plots).